40 research outputs found

    Boosting Neural Machine Translation with Dependency-Scaled Self-Attention Network

    Full text link
    Syntax knowledge contributes its powerful strength in Neural machine translation (NMT) tasks. Early NMT works supposed that syntax details can be automatically learned from numerous texts via attention networks. However, succeeding researches pointed out that limited by the uncontrolled nature of attention computation, the NMT model requires an external syntax to capture the deep syntactic awareness. Although existing syntax-aware NMT methods have bored great fruits in combining syntax, the additional workloads they introduced render the model heavy and slow. Particularly, these efforts scarcely involve the Transformer-based NMT and modify its core self-attention network (SAN). To this end, we propose a parameter-free, dependency-scaled self-attention network (Deps-SAN) for syntax-aware Transformer-based NMT. A quantified matrix of dependency closeness between tokens is constructed to impose explicit syntactic constraints into the SAN for learning syntactic details and dispelling the dispersion of attention distributions. Two knowledge sparsing techniques are further integrated to avoid the model overfitting the dependency noises introduce by the external parser. Experiments and analyses on IWSLT14 German-to-English and WMT16 German-to-English benchmark NMT tasks verify the effectiveness of our approach

    Correction: Predicting Health Material Accessibility: Development of Machine Learning Algorithms

    No full text
    BACKGROUND: Current health information understandability research uses medical readability formulas to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargon form the sole barriers to health information access among the public. Our study challenged this by showing that, for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts that underpin the knowledge structure of English health texts, rather than medical jargon, can explain the cognitive accessibility of health materials among readers with better understanding of English health terms yet limited exposure to English-based health education environments and traditions. OBJECTIVE: Our study explores multidimensional semantic features for developing machine learning algorithms to predict the perceived level of cognitive accessibility of English health materials on health risks and diseases for young adults enrolled in Australian tertiary institutes. We compared algorithms to evaluate the cognitive accessibility of health information for nonnative English speakers with advanced education levels yet limited exposure to English health education environments. METHODS: We used 113 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from Australian and international health organization websites rated by overseas tertiary students, we compared machine learning (decision tree, support vector machine [SVM], ensemble tree, and logistic regression) after hyperparameter optimization (grid search for the best hyperparameter combination of minimal classification errors). We applied 5-fold cross-validation on the whole data set for the model training and testing, and calculated the area under the operating characteristic curve (AUC), sensitivity, specificity, and accuracy as the measurement of the model performance. RESULTS: We developed and compared 4 machine learning algorithms using multidimensional semantic features as predictors. The results showed that ensemble classifier (LogitBoost) outperformed in terms of AUC (0.858), sensitivity (0.787), specificity (0.813), and accuracy (0.802). Support vector machine (AUC 0.848, sensitivity 0.783, specificity 0.791, and accuracy 0.786) and decision tree (AUC 0.754, sensitivity 0.7174, specificity 0.7424, and accuracy 0.732) followed. Ensemble classifier (LogitBoost), support vector machine, and decision tree achieved statistically significant improvement over logistic regression in AUC, sensitivity, specificity, and accuracy. Support vector machine reached statistically significant improvement over decision tree in AUC and accuracy. As the best performing algorithm, ensemble classifier (LogitBoost) reached statistically significant improvement over decision tree in AUC, sensitivity, specificity, and accuracy. CONCLUSIONS: Our study shows that cognitive accessibility of English health texts is not limited to word length and sentence length as had been conventionally measured by medical readability formulas. We compared machine learning algorithms based on semantic features to explore the cognitive accessibility of health information for nonnative English speakers. The results showed the new models reached statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership. Our study illustrated that semantic features such as cognitive ability–related semantic features, communicative actions and processes, power relationships in health care settings, and lexical familiarity and diversity of health texts are large contributors to the comprehension of health information; for readers such as international students, semantic features of health texts outweigh syntax and domain knowledge

    Automatic Construction of Semantic Dictionary for Question Categorization

    No full text
    An automatic method for building a semantic dictionary from existing questions in a pattern-based question answering system is proposed for question categorization. This dictionary consists of two main parts: Semantic Domain Terms (SDT), which is a domain specific term list, and Semantic Labeled Terms (SLT), which contain common terms tagged with semantic labels. The semantic dictionary is built using the proposed method on a set of 2509 questions with semantic patterns in our system. 3390 questions without semantic patterns are used as ground truth to test its performance. Experimental results show that the precision of question classification is improved by 7.5% in average after using the constructed semantic dictionary compared with the baseline method

    An approach for transgender population information extraction and summarization from clinical trial text

    No full text
    Abstract Background Gender information frequently exists in the eligibility criteria of clinical trial text as essential information for participant population recruitment. Particularly, current eligibility criteria text contains the incompleteness and ambiguity issues in expressing transgender population, leading to difficulties or even failure of transgender population recruitment in clinical trial studies. Methods A new gender model is proposed for providing comprehensive transgender requirement specification. In addition, an automated approach is developed to extract and summarize gender requirements from unstructured text in accordance with the gender model. This approach consists of: 1) the feature extraction module, and 2) the feature summarization module. The first module identifies and extracts gender features using heuristic rules and automatically-generated patterns. The second module summarizes gender requirements by relation inference. Results Based on 100,134 clinical trials from ClinicalTrials.gov, our approach was compared with 20 commonly applied machine learning methods. It achieved a macro-averaged precision of 0.885, a macro-averaged recall of 0.871 and a macro-averaged F1-measure of 0.878. The results illustrated that our approach outperformed all baseline methods in terms of both commonly used metrics and macro-averaged metrics. Conclusions This study presented a new gender model aiming for specifying the transgender requirement more precisely. We also proposed an approach for gender information extraction and summarization from unstructured clinical text to enhance transgender-related clinical trial population recruitment. The experiment results demonstrated that the approach was effective in transgender criteria extraction and summarization

    Health Natural Language Processing: Methodology Development and Applications

    No full text
    With the rapid growth of information technology, the necessity for processing substantial amounts of health data using advanced information technologies is increasing. A large amount of valuable data exists in natural text such as diagnosis text, discharge summaries, online health discussions, and eligibility criteria of clinical trials. Health natural language processing, as an interdisciplinary field of natural language processing and health care, plays a substantial role in a wide scope of both methodology development and applications. This editorial shares the most recent methodology innovations of health natural language processing and applications in the medical domain published in this JMIR Medical Informatics special theme issue entitled "Health Natural Language Processing: Methodology Development and Applications"

    Predicting the Easiness and Complexity of English Health Materials for International Tertiary Students With Linguistically Enhanced Machine Learning Algorithms: Development and Validation Study

    No full text
    BackgroundThere is an increasing body of research on the development of machine learning algorithms in the evaluation of online health educational resources for specific readerships. Machine learning algorithms are known for their lack of interpretability compared with statistics. Given their high predictive precision, improving the interpretability of these algorithms can help increase their applicability and replicability in health educational research and applied linguistics, as well as in the development and review of new health education resources for effective and accessible health education. ObjectiveOur study aimed to develop a linguistically enriched machine learning model to predict binary outcomes of online English health educational resources in terms of their easiness and complexity for international tertiary students. MethodsLogistic regression emerged as the best performing algorithm compared with support vector machine (SVM) (linear), SVM (radial basis function), random forest, and extreme gradient boosting on the transformed data set using L2 normalization. We applied recursive feature elimination with SVM to perform automatic feature selection. The automatically selected features (n=67) were then further streamlined through expert review. The finalized feature set of 22 semantic features achieved a similar area under the curve, sensitivity, specificity, and accuracy compared with the initial (n=115) and automatically selected feature sets (n=67). Logistic regression with the linguistically enhanced feature set (n=22) exhibited important stability and robustness on the training data of different sizes (20%, 40%, 60%, and 80%), and showed consistently high performance when compared with the other 4 algorithms (SVM [linear], SVM [radial basis function], random forest, and extreme gradient boosting). ResultsWe identified semantic features (with positive regression coefficients) contributing to the prediction of easy-to-understand online health texts and semantic features (with negative regression coefficients) contributing to the prediction of hard-to-understand health materials for readers with nonnative English backgrounds. Language complexity was explained by lexical difficulty (rarity and medical terminology), verbs typical of medical discourse, and syntactic complexity. Language easiness of online health materials was associated with features such as common speech act verbs, personal pronouns, and familiar reasoning verbs. Successive permutation of features illustrated the interaction between these features and their impact on key performance indicators of the machine learning algorithms. ConclusionsThe new logistic regression model developed exhibited consistency, scalability, and, more importantly, interpretability based on existing health and linguistic research. It was found that low and high linguistic accessibilities of online health materials were explained by 2 sets of distinct semantic features. This revealed the inherent complexity of effective health communication beyond current readability analyses, which were limited to syntactic complexity and lexical difficulty

    A Bibliometric Review of Natural Language Processing Empowered Mobile Computing

    No full text
    Natural Language Processing (NLP) empowered mobile computing is the use of NLP techniques in the context of mobile environment. Research in this field has drawn much attention given the continually increasing number of publications in the last five years. This study presents the status and development trend of the research field through an objective, systematic, and comprehensive review of relevant publications available from Web of Science. Analysis techniques including a descriptive statistics method, a geographic visualization method, a social network analysis method, a latent dirichlet allocation method, and an affinity propagation clustering method are used. We quantitatively analyze the publications in terms of statistical characteristics, geographical distribution, cooperation relationship, and topic discovery and distribution. This systematic analysis of the field illustrates the publications evolution over time and identifies current research interests and potential directions for future research. Our work can potentially assist researchers in keeping abreast of the research status. It can also help monitoring new scientific and technological development in the research field

    Correction to: A pattern learning-based method for temporal expression extraction and normalization from multi-lingual heterogeneous clinical texts

    No full text
    Abstract After publication of the original article [1] it was noted that the captions relating to Figs. 2 and 3 had been interchanged
    corecore